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SUMMARY. Recent discussion of the success of feature selection 
methods has argued that focusing on a relatively small number of fea- 
tures has been counterproductive. Instead, it is suggested, the number 
of significant features can be in the thousands or tens of thousands, 

O rather than (as is commonly supposed at present) approximately in 

<N the range from five to fifty. This change, in orders of magnitude, in 

the number of influential features, necessitates alterations to the way 

k2 in which we choose features and to the manner in which the success 

of feature selection is assessed. In this paper we suggest a general ap- 

^ proach that is suited to cases where the number of relevant features is 

very large, and we consider particular versions of the approach in de- 
tail. We propose ways of measuring performance, and we study both 

C/3 theoretical and numerical properties of the proposed methodology. 
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1 INTRODUCTION 

In this paper we develop statistical methods for determining features that enable ef- 
fective discrimination between two populations of very high dimensional data, when 
the number of component-wise differences that provide leverage for discrimination is 
relatively large but the sizes of those differences are potentially small. By way of con- 
trast, conventional approaches to solving this problem tend to rely on relatively large 
differences and relatively small numbers of components where differences occur. 
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In such problems it is generally going to be of substantial practical interest to iden- 
tify, with reasonable accuracy, the components that have greatest leverage for correct 
discrimination. Simply constructing a classifier, which might depend in a difncult- 
to-determine way on differences between two populations, is generally not going to 
provide all the information that is sought. However, particularly when the number 
of such components is large, we may not be able to identify the components without 
error. How accurate can we be, and in what circumstances is accuracy high? In this 
paper we shall endeavour to answer these questions. 

Achieving reasonable accuracy can involve relatively computer-intensive methods, 
for example algorithms that need 0{p 2 ) rather than 0{p) time if the problem is p- 
dimensional. However, if we use an initial, deterministic dimension reduction step, 
which decreases dimension to q where q <C p, then 0{p 2 ) calculations can be reduced 
to 0{p \ogp + q 2 ), where p logp is the computational cost of ordering the initial p 
components. In many cases we expect q to be a rather crude upper bound to the true 
number, r say, of components that impact on performance of the classifier. The four- 
stage algorithm that we shall introduce in section 2 enables us to reduce computational 
expense from 0{p logp + g 2 ) to 0(p logp + r 2 ). (These order-of-magnitude calculations 
ignore the effects of training sample size, n say, since in the problems we are considering 
n is typically much less than p, q or r and so has relatively little impact on the final 
result.) 

Support for the conjecture that r can be quite large, for example in genomics prob- 
lems, has been given by Goldstein (2009), who, in the words of J.N. Hirschhorn in 
the same issue of the New England Journal of Medicine, "builds a speculative math- 
ematical model and infers that there will be tens of thousands of common variants 
influencing each disease and trait" (Hirschhorn, 2009). Goldstein's (2009) calculations 
are also consistent with r being in the thousands, not just the tens of thousands: 

. . . the genetic burden of common diseases must be mostly carried by large 
numbers of rare variants. In this theory, schizophrenia, say, would be caused 
by combinations of 1,000 rare genetic variants, not of 10 common genetic 
variants. 

(See Wade, 2009.) Kraft and Hunter (2009) argue that "many, rather than few, vari- 
ant risk alleles are responsible for the majority of the inherited risk of each common 
disease." Again, r is large rather than small. 
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In the discussion above one might interpret p and r as representing numbers of 
single nucleotide polymorphisms (SNPs), alleles, or perhaps genes. There are believed 
to be between 10 and 30 million SNPs on a human chromosome, and some 25,000 
genes. However, genomic analyses based on decoding the full DNA of individuals who 
suffer from specific conditions (Wade, 2009) increase the values of both p and r by 
orders of magnitude. (In practice r will be chosen empirically, and so will actually be 
a function of the data, but at the level of the discussion in the present section there is 
little to be gained by making this distinction.) 

Methods for feature selection based on the linear model are generally considered 
only in cases where the number of features is relatively small. Otherwise, the value of 
the response variable can be unreasonably insensitive to changes in a single feature. 
Examples of approaches founded on the linear model include the nonnegative garrotte 
(e.g. Breiman, 1995, 1996; Gao, 1998), the lasso (Tibshirani, 1996), the Dantzig selector 
(Candes and Tao, 2007), and related techniques (e.g. Donoho and Huo, 2001; Fan and 
Li, 2001, 2006; Donoho and Elad, 2003; Tropp, 2005; Donoho, 2006a, 2006b; Fan 
and Ren, 2006; Fan and Fan, 2008). The feature-ranking approach that we consider is 
more closely related to correlation-based approaches of Fan and Lv (2008) and Hall and 
Miller (2008), but it is does not assume the existence of a response variable. Instead 
it utilises class labels via a logistic model. Monograph-length treatments of classifiers 
and related methodology include those of Duda et al. (2001), Hastie et al. (2001) and 
Shakhnarovich et al. (2005). 

Section 2.1, immediately below, proposes a general algorithm for determining fea- 
tures that appear to have significant influence on whether a data vector comes from 
one population or another. Sections 2.2 and 2.3 discuss particular approaches to im- 
plementing the algorithm, and section 2.4 addresses computational labour. Section 3 
develops theoretical properties of the component ranking stage in the algorithm, under 
the assumption that there is a large number of relatively small differences among com- 
ponent means. Section 4 explores properties of the adaptive dimension reduction stage, 
section 5 discusses numerical properties, and section 6 outlines technical arguments. 

2 METHODOLOGY 

2.1. Data and algorithm. Denote the two populations of interest by n and 
Training data from each are acquired as p- vectors X, = (Xn, . . . , X ip ), for 1 < i < n. 
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We also record the value of a label Jj, for each i; it equals or 1, indicating the index 
of the population from which X i came. 

One potential algorithm for identifying indices j, of vector components which 
capture differences between n and IIx, has four stages, (l)-(4) below. The goal of the 
algorithm is to determine empirically a set . . . ,j r } of indices, a subset of {1, . . . ,p}, 
such that the features with indices ji, ■ ■ ■ ,j r have significant influence on whether X; L 
comes from il or III. Those features can then be combined into a classifier, for example 
the support vector machine or a centroid-based method, to effect discrimination. 

(1) Component ranking. Using a method such as that suggested in section 2.2, rank 
all components in terms of their individual influence on I iy interpreted as a zero-one 
response variable. This stage takes 0(p logp) time to run, and produces a permutation 
ji, . . . , j p , say, of 1, ... ,p, where the order of the sequence ji, ■ ■ ■ ,j p is of major impor- 
tance and signifies that, for each k, the component with rank j k has greater leverage 
than the component with rank j k+1 on a measure of our ability to predict l; L from X; L . 

(2) Deterministic dimension reduction. Truncate at q (where 1 < q < p) the sequence 
we derived in step (1). From this point we work only with q- vectors comprised of the 
components with indices ji, . . . , j q . The value of q is determined largely by our com- 
putational resources, bearing in mind that the computational expense of constructing 
the classifier could be as high as 0(q 2 ). 

(3) Adaptive dimension reduction. In this stage we use an empirical method to reduce 
dimension from q, chosen in stage (3), to r, so that the final choice of feature indices is 
ji, . . . , j r . Potential approaches are discussed in section 2.3, and include methods based 
on: (3a) thresholding, (3b) change-point methods, or (3c) application of classifiers to 
blocks of components. 

(4) Backing and Riling. In practice it can be advantageous to rerun stage (3) of the 
algorithm using several of the values of j chosen early in stage (2), or early in the 
implementation of stage (3), bearing in mind that there is potential for noise in the 
choice of ji, for example, to throw the algorithm off course for a period. At this point 
we could, for example, experiment with different choices of block size in method (3c). 

2.2. Method for ranking components. Given an index j between 1 and p, and scalar 
parameters a and f3, we capture the relationship between Ii and Xij by assuming a 
logit model: 

P(U = | X^) = {1 + exp(a + X^)}- 1 , P{U = 1 1 X tj ) = 1 - P{U = | X tj ) . 
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The likelihood of ij, given Xij, is 




where tij = exp(o; + fiX^). Therefore the negative log-likelihood is 
tij(a,P) = — log Lij(Ii \a,(3; X^) = -I t (a + (3X tj ) + log {l + exp(a + (3 X tj )} , (2.1) 
and its counterpart for Xij, . . . , X n j is 

£ 1 (a,[3) = -J2i l3 (a,P). (2.2) 

t=l 

Define (6ij,$j) to be the value of (cx,f3) that minimises £j(a,(3), and put 

= (2.3) 

The ordering j 1 , . . . , j p mentioned in step (1) of the algorithm in section 2.1 is deter- 
mined by the values of £j. Specifically, £? n < . . . < £j . 

2.3. Methods for adaptive dimension reduction. Several approaches are feasible, in- 
cluding: (3a) Thresholding. Here we compute, from the data, a subsidiary criterion 
£j, for 1 < j < p (we might simply choose £j = 0) and a threshold t; we take ko > 
to be an integer; and we define r e [k + 1, q], a function of the data, to be the least 
integer in that range such that £j r+k — h Jr+k > t for 1 < k < k . See section 4 for an 
example. (3b) Change-point methods. Here we look for a change-point in the sequence 
£j x , . . . , £j p , and we take j r to be the location of that point. (There is a vast literature on 
methodology and theory for change-point detection. It includes book-length accounts 
by Carlstein et al. (1994), Csorgo and Horvath (1997), Chen and Gupta (2000) and 
Wu (2005).) (3c) Application of classifiers. For k > 1, let Bk = {](k-i)b+i, ■ ■ ■ ,Jkb} 
denote the kth block of feature indices; here, b denotes block length. (Theoretical con- 
siderations suggest that taking b ~ const, n is appropriate.) In step s of stage (3c) we 
construct the classifier that is based on the training data vectors where all but the com- 
ponents with indices in Ui<fc< s Bk have been stripped away. We use cross-validation to 
measure classifier performance, and in this way we determine whether progressing from 
step s to step s + 1 gives an improvement. If it does not then, subject to the "jiggling" 
suggested in stage (4), we stop at step s. If performance is improved by passing to 
the (s + l)st block then we proceed to step s+1, where we again assess performance. 
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However, this approach can be biased in favour of low apparent error rate, without 
having the same impact on actual error rate; see section 5.3 for discussion. 

2.4. Duration of algorithm. Stage (3) of the algorithm takes 0(r 2 ) time to complete, 
where r, in the range 1 < r < q, denotes the final number of components on which we 
determine that the classifier should depend. In particular, the algorithm concludes with 
a list of r components, say j tl , . . . ,j £r where {j tl , . . .,j £r } = U k < s B k C {j u . . . , jj, 
on which the final classifier is based. The 0(r 2 ) figure is derived as follows: Con- 
structing the classifier from s batches takes 0(sb) time, so the total time needed is 
0(J2i<s<§ s fy = 0(s 2 b) = 0(r 2 ). Here, since b is generally of order n (see section #) 
and s 2 b = 0(r 2 b), we have replaced r 2 b by r 2 since, in the problems we are treating, n 
is generally so much less than r that it can be treated as fixed. Provided the number 
of reruns in stage (4) is only 0(1), the order of magnitude of the time taken to run the 
algorithm to completion is 0(p logp + r 2 ). 

2.5. Discussion. The procedures above are intended to reflect methodologies already 
used in practice. Our main aim is to show that such techniques can be used to address 
not just contemporary problems where there is believed to be considerable sparsity 
and only a small number of significant features (for example, five or ten genes out of 
thousands or tens of thousands), but also reduced sparsity and a larger total number 
of features (for instance, thousands or tens of thousands of DNA sequences out of tens 
or hundreds of thousands of possibilities). Additionally we show that the methods con- 
tinue to work well under minimal distributional assumptions (for example, normality 
is not needed), and minimal conditions about the correlation structure among features. 
In all these senses, procedures such as those described above are particularly versatile. 

3 PROPERTIES OF FEATURE RANKING 

3.1. Main result on ranking. Let 7r = E(Ij) denote the proportion of data that come 
from population III. We assume below that < 7r < 1, and that when the training 
data are drawn randomly from the union of Ho and III, the prior probability that any 
given datum Xi is from Hi equals it. Therefore the corresponding probability for Ho 
is 1 — 7r. We take n, representing the total size of the training sample, to be the key 
asymptotic parameter, and interpret the dimension, p, of Xi as a function of n. 

Next we describe our model. Write X t = (X a , . . . , X ip ), and, when X; L comes from 
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Ill, take 

Xij = fij + Zij (3-1) 

where /ij > are constants and, for simplicity, we assume that the variables are 
identically distributed as Z, say. (This condition can be relaxed; see (2.4) below.) Take 
= Z^ when Xj is drawn from il . The vectors (Zn, . . . , Z in ) and variables Ii, for 
1 < i < n, are assumed to be totally independent. We allow the /ijS to be functions 
of n. 

Write a,j and (3j for the values of a and (3 that jointly minimise £j(a, (3) within 
radius n~ c of (cko,0), where cto = log{7r/(l — 7r)}, for any given c G (0, |). (A separate 
argument can be used to prove that if c\j and (3j are chosen without constraint then, 
under the conditions of Theorem 1 below, they satisfy &j = a + O p {n~ c ) and (3j = 
O p (n~ c ) uniformly in 1 < j < p, for some c G (0, |).) Define 

S,(/,-,)jr« 

n7r (1 — 7Tj 

and note that E(S' j ) = 0. Put A = (n' 1 logn) 1 / 2 . 

Theorem 1. Assume that for eachn, \fij\ < const. A for 1 < j '< p; thatp = p(n) — > oo 
an<i ; /or constants B ± > and B 2 > 2m&x(B 1 + 3,2B 1 ) , p = 0(n Bl ) andO < E\Z\ B2 < 
oo; and that E(Z) = 0. T/ien, uniformly in 1 < j < p, 

£j = ij(6tj,Pj) = .R-§7r(3-2 7r) (.BZ 2 ) -1 (^ + ^) 2 + O p (A 3 ) , (3.3) 

where the random variable R = R(n) does not depend on j. More particularly, the 
O p (A 3 ) term in (3.3) can be written as Qj A 3 ; where, for a constant B 3 depending on 
B 1 and B 2 , and with B A = | {B 2 — 2 m&x(Bi + 3, 2_Bx)} — e for any e > 0, the random 
variables Qj, for 1 < j < p, satisfy: 

j2P(\®j\>B 3 )=0(n- B <) (3.4) 

as n — > oo. 

The statistic S'j is, up to normalisation, the well-known z-score statistic for testing 
whether the jth feature is significant or not; see for example Donoho and Jin (2008, 
2009) and Jin (2009). In (3.3), since the first term, R, does not depend on j then the 
second term is the one that reflects the strengths of individual features. As a result, 
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ranking features according to Ej gives close, but not necessarily the same, results as 
ranking features according to \Sj + fij\. 

The assumption that each has the same distribution is made to simplify dis- 
cussion and notation, and is readily relaxed. For example, it suffices to assume that 
each Zij, for 1 < i < n, is distributed as Zj, say, where, instead of the assumptions 
imposed on Z in the theorem, we ask that: 

E(Zj) = 0, E(Zj) is bounded away from zero and infinity uniformly 

in j, and P(\Zj\ > x) < P(\Z\ > x) for all x > and some random (3.5) 

variable Z, where E\Z\ B2 < oo. 

In this case the moment E(Z 2 ) in (3.3) would be replaced by E(Z 2 ). The conclusions 
that we draw, below, from Theorem 1 are unchanged, provided we interpret fij as 
Hj/(EZjY^ 2 during discussion. 

Although we ask that the vectors Xi be independent, we make no assumption about 
the relationships among their components. For example, the values of Zn, . . . , Zi V can 
be highly correlated (indeed, in an extreme case, equal to one another) or completely 
independent. The latter instance is actually the most difficult, in terms of rigorously 
establishing that (3.3) and (3.4) hold. At the other end of the spectrum, the case 
where Zn = . . . = Zi v with probability 1 is trivial, since there effectively only a single 
component index, with different candidate values for the mean, has to be treated. 

3.2. Expected number of misrankings. Assume that some of the fijS are zero and all 
the others are strictly positive. Ideally, we would like the criterion Ej to be a good 
indicator of the positivity of fij, in particular to take a lesser (or larger negative) value 
if fij is positive than it does when fij = 0. Reflecting this aspiration, if there exist 
component indices j\ and j'2 such that fij x > and fij 2 = 0, but £j 2 < £j x , then we shall 
say that a misranking has occurred. The expected total number of misrankings, 



is a measure of the performance of Ej as a criterion for distinguishing between positive 
and zero values of fij] lower values of fmismnk correspond to higher performance. 



the wrong result about half the time. The following theorem makes this clear. Let pi 





jr'i : n 31 >0 n ■ tJ-j 2 =0 



Since the random variable Sj in (3.2) and (3.3) has standard deviation of size n -1 / 2 
then, if the positive fijS are of smaller order than n -1 / 2 , with probability converging 
to | any attempt to rank any pair of means fij using the values of Ej will produce 
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denote the number of indices j for which > 0, and put 

a lj 2 ,± = vr (1 - tt) var (Z lh ± Z lj2 ) (3.6) 
for respective choices of the plus and minus signs. 

Theorem 2. Assume the conditions of Theorem 1, that sup J<p fj,j = o(n~ l l 2 ), and 
that cr^ - 2 ± zs bounded away from zero uniformly in j\, j 2 choices of the ± signs. 
Then P(£j 2 < — > | uniformly in 1 < ji, j2 < p ; particular uniformly in pairs 
ji,j 2 such that n h > and /i j2 = 0. Moreover, z/ misrank = |pi (p - pi) + o{p 1 (p-pi)} 
as n — > oo. 

Likewise, if the positive /i^s are of size n -1 / 2 then the probability of incorrectly 
ranking the jith component lower than the j^th component, even though Hj 1 > and 
fij 2 = 0, does not converge to zero. The next theorem quantifies this property. There 
we define $ to be the standard normal distribution function. 

Theorem 3. Assume the conditions of Theorem 1, and that each nonzero fij equals 
cn~ 1 / 2 where c > 0. Then 

P (L < 4) = ®(-c/°hh,+) ®(c/°hh,-) + ®(c/°nh,+) H~c/cr jlj2 ,-) + o(l) , 
uniformly in ji, J2 such that > and /ij 2 = 0. Furthermore, 

^misrank = /Z Yl { $ (~ C / a hh,+) $ ( C /°ilj 2 - ) 

+Hc/a jlj2 , + ) $(-c/cr Jlj2 ,_)} + o{ Pl (p - 

as n — > oo. 

Here, by default, $(— oo) = and $(oo) = 1. This is relevant when o-j 1 j 2> ± = 0. 

If the number of components where the mean is positive is large, for example if 
it equals a non-negligible proportion of the total number, p, of components, then the 
number of misrankings can generally not be reduced to low levels unless we take the 
nonzero means to be a little larger than n~ 1//2 in order of magnitude terms. It is enough 
to take the positive mean to be a logarithmic factor larger; specifically, the mean should 
equal cA where, as before, c > and A = (n" 1 logn) 1 / 2 . Theorem 4, below, shows 
that in this case the expected number of misrankings can be reduced to a quantity of 
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smaller order than p, or even to a number that converges to zero polynomially fast, 
depending on how large we choose c. 

As a prelude to stating Theorem 4 we introduce an assumption which asks that 
the random variables Zj satisfy a pairwise Cramer continuity condition. Here it is 
convenient to assume that there exists an infinite stochastic process Z 1 ,Z 2 ,... such 
that: 

(a) Each Zj has the distribution of Z (in this sense the pro- 
cess Zi,Z 2 ,... is weakly stationary), (b) if Xj = (Xn,...,X ip ) is 
drawn from Il then X a , . . . ,X ip has the same joint distribution as , , 
Zi, . . . , Z p , and (c) if X^ is drawn from II x then X a , . . . , X ip has the 
same joint distribution as Z\ + fj,i, . . . , Z p + /j, p , where the fijS are the 
nonnegative constants introduced prior to Theorem 1. 

The Cramer continuity condition we impose is the following: 

limsup sup sup \E{exp(iti Z^ + it 2 Zj 2 )}\ < 1 , (3.8) 

i^oo |ti| + |t 2 |>* l<Jl<J2<oo 

where on this occasion % = \f—l. For example, (3.8) would hold if the process Zj 
were strictly stationary and each pair (Zj 1 , Zj 2 ) had a joint density fj 1 j 2 that satisfied 
su Pjij 2 II \fjih\ < °°' where fj 1 j 2 (x 1 ,x 2 ) = (d 2 /dx 1 dx 2 ) f jlh (xi,x 2 ). It would also 
hold if the variables Zj were independent with a common nonsingular distribution. 

Recall that p\ equals the number of indices j such that fij > 0, and that B 4 = 
-04(e) = \ {B 2 — 2 max(_Bx + 3, 2Bi)} — e where e > 0. Define <Jj x j^± by (3.6) and put 
n n = (logn) 1 / 2 . 

Theorem 4. Assume the conditions of Theorem 1, that (3.7) and (3.8) hold, and 
that B 2 , in the moment condition E\Z\ B2 < oo, is so large that for some e > 0, 
Pi (p — Pi) — o{n Bi ). Take each nonzero jij to equal c(n~ l logn) 1 / 2 , where c > 0. Then 

^misrank = {1 + o(l) } ^ : Hl >0 Ej 2 : ^=0 {$(-CK fi /(T Jlj2 , + ) 

+<£>(-CK n /a jlj2 ,_)} + o(l) (3.9) 

as n — > oo. 

Elucidation of (3.9) requires information about the covariance of the process Zj, 
in (3.7). For simplicity let us assume that the variables Zj are uncorrelated. Then by 
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(3.6), o-j lj2 ± = 2 7r (1 — 7r) E(Z 2 ) = (2c ) 1 , say, for either choice of the ± signs. Hence 
(3.9) implies that 

Wank = {1 + (p - Pi) {2 7TC(2C log 7l) 1/2 } ~* + (l) . 

Therefore, if c is chosen so large that pi (p— pi) (logn)^ 1 / 2 n~ CQc2 — > then the expected 
number of misranks will converge to zero. For smaller positive values of c the expected 
number will be of smaller order than the potential number of misranks, pi (p — Pi), but 
it will not necessarily be negligible itself. 

The results in Theorems 2-4 have benefited from a simplification afforded by the 
assumption that the variables all have the same distribution, and in particular 
have the same variance. As noted below Theorem 1, that condition can be relaxed 
and the assumption (3.5) imposed instead. In practice, however, one could standard- 
ise, in a componentwise fashion, the values of Xij for scale, and in that case it is 
possible to state versions of Theorems 2-4 in settings where E(Zfj) varies with j. 
The model that we have been using, i.e. = Z^ + c(n~ l logn) 1 / 2 ^ where the Z^s 
are independent, is (for moderate n) a good approximation to the standardised form 
X[- = Z[- + c(n _1 logn) 1 / 2 Jj, where the Z[-s satisfy (3.5). Detailed arguments here 
are similar to those given by Hall and Wang (2008). 

3.3. Effects of dependence of the process Zj on interpretations of (3.3). The expected 
value of the number of misrankings, which we treated in section 3.3, is not as much 
affected by dependence among components of the Zj process as are other aspects 
of the distribution of the number of misrankings. For example, if the ZjS (in the 
stochastic process Zi, Z 2 , . . . introduced in (3.7)) are all independent then the quantities 
Sj, defined at (3.2) and on which the values of £j predominantly depend (see (3.3)), 
are also independent, and so decisions based on the respective values of £j are made 
virtually independently of one another. In this case the variance of the total number 
of misrankings is relatively low. However, if the ZjS are highly correlated then the 
variance can be higher, although it depends on how the positive means are distributed 
among the components of JQ. In the present section we briefly discuss these issues. 

Let the process Z ± , Z 2 , ... in (3.7) be ^-dependent, meaning that any subsequence 
Zjj_, . . . , Zj k such that — ji > (, for each £, is comprised entirely of independent 
random variables. We permit ( to diverge with n, and we suppose that pi, the number 
of nonzero values of fj,j, can also increase with n and that limsup^^ pi/p < 1. One 
approach to arranging the nonzero means is to distribute them randomly, for example 
taking /ij = Jj /i for 1 < j < p, where J±, . . . , J p is a random permutation of pi ones 
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and p — pi zeros and is independent of the ZjS in (3.7). In this setting the clustering 
that arises through dependence is often negligible, even if the dependence in the process 
Zi, . . . , Z p is quite strong. 

To appreciate why, note that the expected number of nonzero means in each string 
of £ consecutive components of X iy when X i is drawn from III, equals ( x p' 1 pi, and 
that if ((Pi) 2 — °{p) then the probability that none of the approximately p/( strings 
(placed end to end) of ( consecutive components that contain one or more nonzero 
means are adjacent, and the probability that none of the strings contains more than 
one component, both converge to zero. Therefore, in view of the assumption of (- 
dependence, if we treat the ZjS as independent and identically distributed when making 
a statement about properties of rankings deduced from (3.3), the probability that we 
commit an error in the statement converges to zero as n — > oo. It can then be deduced 
that, in cases where the positive means are randomly distributed and (Cpi) 2 = o(p), 
the variance of the number of misrankings is relatively low. 

Alternatively, rather than scatter the nonzero means /ij randomly throughout the 
vector (Zi, . . . , Z p ), we could place them all down one end. This makes the distribution 
of those quantities just about as "clumpy" as possible, by exploiting the ^-dependence 
property. For example, if p± < ( then all of the nonzero means are attributed to the first 
Pi variables in the sequence X a , . . . ,X ip , when Xi is drawn from population IT. The 
assumption of ^-dependence permits Xn = . . . = X ipi with probability 1, whenever 
Xi comes from either n or IT. This reduces the amount of available information, 
since all the components that contain information for discriminating between n and 
IIi are simply copies of one another; there are no independent sources of corroborating 
information. Moreover, the values of Sj in (3.3) are identical for 1 < j < pi, and so 
the values of £j are the same too, up to remainders of order A 3 , which implies that the 
total number of misranks is approximately equal to p\ times the number of times that 
a specific component with a positive mean is misranked. 

Of course, this increases the variance of the number of misranks. The setting pi < ( 
can encompass instances where (CPi) 2 = °(p), which was shown two paragraphs above 
to result in a relatively high amount of information about the differences between n 
and IIi when the nonzero means are scattered randomly in the data vector. These 
examples illustrate the more general rule that, in cases where the positive means are 
distributed consecutively in relatively long-range dependent vectors X iy the variance 
of the number of misranks tends to be higher than in cases where those means are 
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distributed at random. 

We conclude this section by comparing the notion of misranking with that of 
(feature) False Discovery Rate (FDR); for the latter, see for example Benjamini and 
Hochberg (1995) and Abramovich et al. (2006). For any set of selected features, FDR 
equals the fraction of falsely selected features. This concept and that of misranking 
both provide informative measures of how well important features are ranked, but they 
are nevertheless different in important respects. To appreciate why, let us focus on a 
specific feature. If we adhere to the notion of FDR then all that matters is whether 
the feature is selected or not. If we instead we use the concept of misranking then 
the order or rank of the feature being selected also matters. Technically there are also 
important differences. For instance, misrankings are defined quite simply in terms of 
pairwise comparisons of individual features, while FDR can involve higher-order rela- 
tionships among different features. If we consider the influence, on these measures, of 
the correlation structure among features, then misranking depends only on pairwise 
correlations, but FDR may depend on high-order correlations. As a result, FDR can 
be significantly more difficult to characterise than misranking, and requires much more 
heavily constrained assumptions about dependence than are necessary using the mis- 
ranking measure. Therefore, since the central problem is how well important features 
are ranked, it is more appropriate to assess performance here using misranking, rather 



The number of misrankings bears a close relationship to both the Wilcoxon rank- 
sum test and the area under curve (AUC) of the ROC plot (see Hanley and McNeil, 
1982). In this case the ROC is constructed with respect to whether each variable is 
correctly classified as having nonzero mean on not. In fact, it is possible to show that 



Thus we may interpret properties of z/ misran k in Theorems 2-4 in the context of AUC. 
For instance, under the assumptions of Theorem 2 the AUC score decays to 0.5, this 
being the score for the random guessing model. 



4 THRESHOLDING FOR ADAPTIVE DIMENSION REDUCTION 



than FDR. 



AUC = 1 — {pi{p — Pi)} 1 (# misrankings) . 



(3.10) 



Recall that in Theorem 1 we showed that £j equals —Uj 1 , where 



1^ = ^(3 - 270 {EZ 2 y\s,+ H ) 2 , 



(4.1) 
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plus a quantity that does not depend on j, plus a remainder term that is uniformly 
smaller in size. The feature-ranking step (stage (1) in the algorithm in section 2.1) 
aspires to re-order the indices j such that the indices for which /ij 7^ are ranked 
first, and those for which fXj = are listed together at the end of the sequence. If this 
objective is largely achieved then the main task that remains is to choose the point in 
the ranking where the change occurs; this is stage (3) of the algorithm in section 2.1. 
In the present section we explore method (3a), based on thresholding; see section 2.3. 

Observe that if Uji is as at (4.1) then Uj 1 = U j2 + U j3 , where 

U j2 = ±n(3-2n)(EZ 2 )- 1 S], U j3 = \ tt (3 - 2 tt) (EZ 2 )' 1 (tf + 2 pt 3 ; Sj) . (4.2) 

If we can construct a good approximation, Uj 2 say, to Uj 2 then we can subtract it from 
£j, leaving only Ujz plus a small remainder. The value of Ujz is exactly zero if /ij = 0, 
and is strictly positive with high probability if /ij > 0. The quantity £j = —Uj 2 is 
referred to in that notation in method (3a) in section 2.3. If we choose an appropriate 
threshold, t say, then we can implement (3a) as follows: 

define r e [k + 1, q], a random variable, to be the least integer in that , , 
range such that £j r+k — £j r+k > t for 1 < k < k , 

where k > is a fixed integer. Then, subject to the jiggling step in stage (4) of the 
algorithm, we determine that the features with indices ji, ■ ■ ■ ,j r are the ones that have 
greatest influence on whether a data value Xj came from Il or III. 

To define Uj 2 , put tt = rC 1 I- L and Xj = rT x Xy, define our estimator of 
r 2 = E(Z 2 ) by f 2 = (np)' 1 E j (*y - and let Sj = {nn(l- tt)}" 1 {h ~ 

7r) (Xij — Xj); compare the definition of Sj at (3.2). Then, motivated by the definition 
of U j2 at (4.2), put 

-i j = U j2 = ±Tt(3-2Tt) r-^S]. 

Theorem 5, below, shows in effect that this is a good approximation to Uj 2 . Let B 4 be 
as in Theorem 1. 

Theorem 5. Under the conditions of Theorem 1 we can write 

£ j -£. = R- u j3 + Qj X s , (4.4) 
where R is as in (3.3) and, for a constant B > 0, the random variables Qj satisfy 

j^P{\^\>B)=0(n- B *). (4.5) 
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Finally we describe implementation of method (3a) in stage (3) in section 2.3. 
We shall assume we are in the context of Theorem 4, where the nonzero pjS equal 
c (n -1 logn) 1 / 2 . Here it is appropriate to take the threshold t — t(n) to be a sequence 
of negative numbers such that \t\ is of strictly larger order than A 3 and of strictly 
smaller order than A 2 ; that is, such that 

|t| = o(A 2 ) and A 3 = o(|t|). (4.6) 

Let ki denote the number of nonzero means added to the components of Xi when 
Xi is drawn from III. To simplify discussion we assume that k\ is nonrandom, although 
of course it may depend on n. Below Theorem 4 we discussed a case where the expected 
number of misrankings converged to zero, and hence the probability of a misranking 
occurring also tended to zero. In this setting, 



with probability converging to 1, pj k > for k < k\ and p~ ]k = for , , 
k>k ± , ' [ ' 

and it can be shown that if t < satisfies (4.6) then the definition of r at (4.3) produces 
a random variable which, with probability converging to 1 as n — > oo, equals ko + k±. 
Therefore, taking ko = in the rule at (4.3) ensures that, with probability converging 
to 1, r is exactly equal to k\. Cases where the nonzero means are of size c (n~ l logn) 1//2 , 
but c is not sufficiently large to ensure that (4.7) holds, can be treated satisfactorily 
by choosing t < to satisfy (4.6) but taking ko > 1. Depending on the strength of 
correlation between components of the vector X it is possible to choose a fixed ko such 
that, with probability converging to 1, r is within Cpk\ of ki, where C > and p 
equals the expected proportion of feature indices that have positive means when is 
drawn from ITx, but are incorrectly ranked at a low level. 



5 NUMERICAL PROPERTIES 

5.1. Stability of misranking totals. Here we simulate under the model at (3.1), where 
the signals are represented by pj and the noise by Z^. If we measure performance in 
terms of the total number of misrankings, or equivalently in terms of AUC (see (3.10)), 
and if the pjs decrease at rate n -1 / 2 , then Theorem 3 implies that performance should 
be stable as a function of sample size, n. That is, it should depend very little on n. To 
explore this property numerically we consider the cases n = 20, 50, 100 and 200, with 
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p = 0.4 x n 2 . We take 90% of the fijS to equal zero and the others to equal c (20/n) 1 / 2 , 
where c = 0.4, 0.8, . . . , 0.16. The noise variables Z^ are independent and identically 
distributed as N(0, 1). 

Figure 1 shows how the expected value of AUC varies with n. The dashed lines on 
either side of each curve are 95% pointwise confidence bands for the AUC estimate, and 
quantify the uncertainty of the simulation study. The key feature is that, as predicted 
by Theorem 3, AUC changes very little with n, even when n is small. As expected, 
and as predicted by Theorems 2-4, AUC increases with increasing cq. 



o 

< 



1.0 



0.8 



0.6 



0.4 



0.2 



0.0 




20 



50 



100 



200 



Figure 1: AUC scores in simulation study with u.j = c (20/n) 1 / 2 ; in the example of 
section 5.1. 



We also explored cases where the nonzero fijs took random values, in particular 
where they were drawn randomly and uniformly from the interval [0, Co (20/n) 1 / 2 ]. This 
case is more challenging, since the genuine signals are now strictly smaller than in the 
previous situation. Therefore it comes as no surprise to learn that the AUC levels for 
each Co are reduced. However, the overall pattern of stability with respect to n is still 
evident, with very slightly more variation than in the case of fixed /j,jS. 

5.2. Influence of correlation on misranking performance. Next we discuss the effects of 
correlated noise Z%j in the model at (3.1). We take the noise to be a moving average of 
order 1, i.e. Z^ = pZij^i + (1 — p 2 ) 1//2 e^j, where the e^js are independent and normal 
N(0, 1). Thus, Zij and Z^ are correlated for all pairs {j,k), with the coefficient of 
correlation decaying exponentially fast in \j — k\. The value of Co is fixed at 1.2, and 
nonzero fijS are chosen uniformly in [0, Co (20/n) 1//2 ]. The values of n, p and the number 
of true signals are as in section 5.1. 
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Table 1 gives values of Monte Carlo approximations to the means and standard 
deviations of AUC scores when the variables, or features, for which pj is nonzero are 
grouped together among the lowest values of j, as in the discussion following Theorem 4. 
The main observation is that while mean AUC remains stable across the table, the 
variability of AUC is much greater when strong correlation exists. For instance, if 
n = 200 then when p = 0.99 the standard deviation of AUC scores is ten times larger 
than when p = 0. The presence of correlation makes the problem significantly more 
difficult; it effectively inserts an element of randomness into the process of correctly 
ranking important features. 

Table 2 shows results in the same setting, except that the indices of variables 
where pj is nonzero are distributed randomly between 1 and p. There is again a 
high degree of stability, but variability is comparatively less than that in Table 1, 
consistent with the discussion in section 3.3. In particular, by randomly distributing 
the indices of the nonzero pjS we effectively reduce dependence among the variables 
that are important, and so, reflecting the results in Table 1, the problem becomes less 
statistically challenging. 







AUC 


means 






AUC std dev. 


p 


n = 20 


n = 50 


n = 100 


n = 200 




n = 20 


n = 50 


n = 100 


n = 200 


-0.99 


0.725 


0.698 


0.706 


0.709 




0.136 


0.078 


0.039 


0.018 


-0.75 


0.704 


0.717 


0.715 


0.712 




0.064 


0.022 


0.012 


0.006 


-0.50 


0.650 


0.705 


0.718 


0.711 




0.070 


0.024 


0.011 


0.006 


-0.25 


0.702 


0.725 


0.706 


0.716 




0.062 


0.025 


0.012 


0.007 


0.00 


0.699 


0.707 


0.711 


0.714 




0.070 


0.027 


0.012 


0.006 


0.25 


0.687 


0.714 


0.707 


0.712 




0.080 


0.032 


0.015 


0.007 


0.50 


0.666 


0.682 


0.713 


0.713 




0.094 


0.037 


0.017 


0.009 


0.75 


0.715 


0.710 


0.718 


0.719 




0.101 


0.047 


0.025 


0.013 


0.99 


0.662 


0.725 


0.708 


0.704 




0.259 


0.157 


0.107 


0.065 



Table 1: Mean and standard deviation of AUC scores for simulation with correlated 
noise and grouped effects. 



5.3. Prediction in a large simulated problem. Here we present the analysis of a sin- 
gle simulated dataset, demonstrating how our approach performs when the centroid 
classifier is used. We take p = 10,000 and n = 100 (50 for each class). Ten percent 
of the variables (in the model at (3.1)) include a nonzero signal. These 
drawn from the uniform distribution on [0,0.35], and the noise variables are in- 
dependent N(0, 1). This is a particularly difficult problem, since the signals are very 
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AUC 


means 






AUC std dev. 


p 


n = 20 


n = 50 


n = 100 


n = 200 




n = 20 


n = 50 


n = 100 


n = 200 


-0.99 


0.712 


0.712 


0.713 


0.715 




0.139 


0.058 


0.032 


0.016 


-0.75 


0.702 


0.710 


0.713 


0.714 




0.078 


0.030 


0.014 


0.008 


-0.50 


0.705 


0.710 


0.711 


0.714 




0.076 


0.032 


0.015 


0.008 


-0.25 


0.706 


0.705 


0.713 


0.714 




0.078 


0.032 


0.015 


0.007 


0.00 


0.696 


0.709 


0.714 


0.712 




0.077 


0.030 


0.015 


0.007 


0.25 


0.698 


0.712 


0.710 


0.714 




0.078 


0.034 


0.016 


0.008 


0.50 


0.703 


0.713 


0.714 


0.714 




0.081 


0.030 


0.016 


0.007 


0.75 


0.698 


0.712 


0.711 


0.714 




0.088 


0.034 


0.016 


0.008 


0.99 


0.741 


0.721 


0.712 


0.710 




0.195 


0.091 


0.053 


0.024 



Table 2: Mean and standard deviation of AUC scores for simulation with correlated 
noise and randomised effects. 
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Figure 2: Ideal prediction success for the example of section 5. 3. 

weak compared to the noise, and sample size is quite small. 

Figure 2 shows the prediction performance of the centroid classifier on a test set of 
1,000 replicates in the case of "ideal variable selection," where the 1,000 variables with 
nonzero signals are selected first, in decreasing order of signal strength, followed by the 
9,000 variables where the signal is not present. In particular, the order is not chosen 
empirically. The minimum of the graph occurs at 449 variables (out of a maximum 
of 1,000), and corresponds to a misclassification rate of only 0.5%. The decrease in 
predictive performance caused by less useful, or redundant, variables is apparent from 
the figure; the weaker genuine variables actually hurt prediction performance because 
they contain more noise than signal. Also of note is the fact that a large number of 
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variables is needed to obtain good prediction. For example, if attention is confined 
only to the strongest 50 variables then the misclassification rate increases to 16%. 




False Positive rate 

Figure 3: ROC plot for feature ranking in the example of section 5.3. 

For the same dataset we undertook variable ranking based on the values of £j, 
defined at (2.3). The result was the ROC chart in Figure 3. There the value of k in 
the ranking j±, . . . ,]k,]k+i, ■ ■ ■ , j P , defined in the sentence below (2.3), is represented 
as k/p on the horizontal axis, and the vertical axis depicts the value of a 
ratio of two negative numbers. The area under the empirical curve, i.e. AUC, equals 
0.626, meaning that a fraction 1 — 0.626 ~ 37% of the paired scores correspond to a 
misranking. The ROC curve is indexed by model size, with bottom left denoting an 
empty model and the top right a full model. For a given model size, we can read off 
the chart the corresponding sensitivity, or proportion of true variables included, and 
the false positive rate, or proportion of redundant variables in the model. Ideally a 
model should have high sensitivity and low false positive rate, and the chart indicates 
the tradeoff between the two for various model sizes. 

Figure 4 shows how prediction accuracy varies with model size. Performance is 
now clearly a long way from that represented in Figure 2, where the 1,000 variables 
with nonzero signals were listed first in decreasing order of strength. The minimum 
misclassified rate is now 13.7%, and requires the use of 3,258 of the 10,000 features. As 
discussed in the previous paragraph, every model size corresponds to a position on the 
ROC plot in Figure 3, in this case (0.31,0.51). Hence the optimal model found here 
contains 51% of the genuine variables, and 31% of the redundant ones. 
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Figure J±: Performance using variable ranking and centroid classifier for the example 
of section 5. 3. 



We next explore stage (3) of the four-stage algorithm suggested in section 2.1, 
addressing in turn each of the approaches (3a)-(3c) discussed in section 2.3. We imple- 
ment the threshold method, (3a), by comparison with a randomised model where the 
observed classes Ij are scrambled and likelihoods recalculated, and employ the approach 
suggested in the second paragraph of section 4; see (4.3). In particular, the threshold 
is chosen by computing the lOOath percentile of scores for the scrambled data. Doing 
this for a = 0.2 corresponds to seeking a false positive rate of 0.2. In the numerical 
example that we are considering here, this recovers a model with 2,159 predictors and 
produces a test set misclassification rate of 16.8%. A model this size corresponds to 
the point (0.196, 0.40) on the ROC chart. Notice that we have effectively targeted the 
false positive rate of a = 0.2 via this approach. 

To provide an example of the change-point method, (3b), suggested in section 2.3 
for choosing model size, we consider the ratio of the sorted likelihoods from the original 
and scrambled rankings. These are plotted in Figure 5, along with a 45° line. Starting 
with the weakest variables, we expect the ratio to remain near 1 until a sizeable number 
of variables that genuinely contain a positive signal cause the ratio to shrink. For this 
purpose we can use the simple change-point statistic for detecting a change in the mean 
(see Chapter 2 of Csorgo and Horvath, 1997), 

T(t) =n- 1/2 {S(nt) -tS(n)}, 

where S(k) equals the cumulative sum of the first k ratios, and t G (0, 1) denotes the 
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Figure 5: Comparison of actual and randomised log-likelihoods in the example of sec- 
tion 5.3. 

examined proportion of the dataset. This leads to a model with 1,760 variables and 
a misclassification rate of 16.2%, comparable to that when using method (3a). This 
model size corresponds to the point (0.16, 0.36) on the ROC plot. 

Finally, in reference to the classifier-based approach (3c) suggested in section 2.3, 
we note that the apparent error rate can be driven quickly to zero without the actual 
error rate being reduced as much as it is if we employ methods (3a) or (3b). For 
example, when using (3c) in conjunction with the centroid classifier the "best" model, 
with apparent error rate equal to zero, occurs when just 39 variables are selected; but 
the misclassification rate on the test set is 32.5%, almost twice that obtained for either 
of methods (3a) and (3b). 

5.4. Results for real data examples. A challenge when using our methodology to analyse 
previously considered real datasets is that the latter were possibly considered because 
they illustrate cases where only a very small number of variables determine the class 
label. In particular, contrary to the concerns raised by Goldstein (2009), the number of 
influential components is quite small. To simplify matters, we demonstrate here that 
likelihood based ranking is a powerful tool for improving a wide variety of classifiers. 
We make use of three well-known sets of microarray data. These relate respectively 
to leukemia (Golub et al., 1999), colon cancer (Alon et ah, 1999) and prostate cancer 
(Singh et al, 2002) and have 7,129, 2,000 and 6,033 components respectively Dettling 
(2004) and Donoho and Jin (2008) discuss the performance of a variety of classifiers 
on these datasets, using a two-thirds/one-third split of the data into training and test 
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samples. Their results are reported in Table 3. Readers may refer to the above papers 
for details on specific methods. 



Method 


Leukemia 


Colon 


Prostate 


Bagboo 


4.08 


16.10 


7.53 


Boost 


5.67 


19.14 


8.71 


RanFor 


1.92 


14.86 


9.00 


SVM 


1.83 


15.05 


7.88 


DLDA 


2.92 


12.86 


14.18 


KNN 


3.83 


16.38 


10.59 


PAM 


3.55 


13.53 


8.87 


HCT 


2.86 


13.77 


9.47 



Table 3: Percentage misclassification rate of different methods on microarray datasets. 

To test the effectiveness of likelihood-based ranking we chose the best classifica- 
tion method and the random forest classifier (a consistent performer) for each of the 
datasets. An extra step was added to each cross-validation fold; the two-thirds training 
data was used to rank variables based on the likelihood score, and then only a pro- 
portion of the top-ranked variables were used to estimate the final model. The results 
are presented in Table 4. The last row of the table shows results for the full dataset; 
they should in theory match those in Table 3, with differences attributable to tuning 
approaches. We could not reproduce the accuracy reported for DLDA on the colon 
dataset, and so used the next best method (PAM). 

In each case accuracy can be improved by reducing the model size. For the best 
classifiers on each dataset, this effect was small but noticeable; for the leukemia data, 
dimension was reduced by 25% and error by 5%; for the colon dataset, dimension 
was reduced by 62.5% and error by 1%; and for the prostate dataset, dimension was 
reduced by 62.5% and error by 3%. For the random forest models the results were even 
more pronounced, with marked imporvement in prediction and significant dimension 
reduction. For the prostate dataset, the error was reduced by 25%, using just 0.005 
of the available variables in each fold. This suggests that the likelihood based ranking 
method can effectively control the sparsity of a model and potentially improve model 
performance. 

While firm conclusions are difficult here, we argue that this analysis presents evi- 
dence for a large number of relatively weak effects contributing to a model. Indeed, in 
all but one case we would prefer a model size larger than the dozens, or fewer, used in 
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Prop. 


Leukemia 


Colon 


Prostate 




SVM 


RanFor 


RDA 


RanFor 




RanFor 


0.0025 


34.72 


4.58 


14.29 


16.58 


8.82 


8 


0.005 


34.61 


3.78 


13.00 


15.71 


8.37 


7.67 


0.01 


6.86 


3.36 


13.13 


15.77 


7.84 


7.73 


0.025 


1.89 


2.97 


13.10 


15.74 


7.16 


8.20 


0.05 


1.67 


2.58 


13.26 


15.16 


7.29 


8.55 


0.10 


1.58 


2.50 


13.19 


14.94 


7.02 


8.90 


0.15 


1.67 


2.22 


13.16 


14.94 


7.10 


9.14 


0.25 


1.64 


2.36 


12.90 


15.03 


7.06 


9.49 


0.375 


1.61 


2.11 


12.87 


15.52 


6.82 


9.67 


0.50 


1.58 


2.22 


13.00 


15.58 


6.82 


9.90 


0.75 


1.53 


2.36 


12.97 


16.13 


7.02 


9.80 


1.00 


1.61 


2.31 


13.00 


16.32 


7.04 


10.24 



Table 4-' Performance of best methods on reduced datasets, using likelihood based rank- 
ing. 

many conventional approaches to variable selection. Furthermore, our variable ranking 
appears to be a useful means of determining the effective model size. 



6 TECHNICAL ARGUMENTS 

6.1. Proof of Theorem 1. Define %(ai, (3) and £j(a, (3) as at (2.1) and (2.2), respectively, 
and put £(et, (3) = E{£ij(a, (3)}. To simplify notation, and since for the most part we 
shall work with one j at a time, we omit mention of j in the notation £(a,(3), and 
likewise we drop the subscript j on /ij. 

The event that Xj comes from IT occurs with probability tc if X,- L is sampled from 
the union of n and IT. Thus, conditional on the sampling operation, Xij — /i + Zij 
with probability n, and X^ = with probability 1 — 7r, where each is distributed 
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as Z. Therefore, 

e(a,(3) = -it (a + (3 p) + it E[\og{l + exp(a + (3 p + (3 Z)}} 

+ (1 - tt) £[log{l + exp(a + (3Z)}\, 

d 

d 10 (a,(3) = —e(a,(3) = l-Tr-7iE\{l + exp(a + (3i2 + (3Z)}- 1 } 
oa 

-{I - it) E[{\ + e*v{a + P Z)}' 1 ] , (6.1) 
d 01 (a,f3) = ^£(a,(3) = -nE[(p + Z){l + exp(a + (3p + (3Z)}~ 1 ] 

-{l-u)E[Z{l + e W {a + f3Z)}- 1 ] , (6.2) 

where we have used the fact that E(Z) = 0. 

Let p = e a I (l + e a ). Since E\Z\ 4 < oo then Taylor expansion shows that, uniformly 
in \a\, \p\ < C for a fixed constant C > 0, and as (3 — > 0, 

Elil+expia + pfi + pZ)}- 1 } 

= {l + e a y 1 E(\l + p{(3(f, + Z) + l(3 2 ^ + Z f + ...}\ 1 ^ 

= (l-p) E [l-p{P{p + Z) + \p 2 {p + Z) 2 } + p 2 (3 2 (p + Zf] + O (|/5| 3 ) 

= (1 - P) {l - P/3/i + P (p - |) /? 2 (p 2 + ^ 2 ) } + 0(\P\ 3 ) , (6.3) 
£[(/i+Z) {1 + exp(a + (3p + /9Z)}- 1 ] 

= (l-p)s((|i + Z) [l-p{/?(p + ^) + ^ 2 (p + Z) 2 } + p 2 (3 2 (p + Z) 2 ]) 

+0{\(3\ 3 ) 

= (1 - p) {p - p[3 (p 2 + EZ 2 ) + p (p - I) [3 2 (p 3 + 3p EZ 2 + EZ 3 ) j 

+0(|/3| 3 ). (6.4) 

Combining (6.1)-(6.4) we deduce that, uniformly in |a|, \p\ < C and as (3 — > 0, 

dio(a,/3) = l-7r-(l-p){l-p/37r/i + p(p-|)/? 2 (7rp 2 + EZ 2 )} 

+0(|/?| 3 ), (6.5) 
doi(a,/3) = -(l-p){7rp-p/3(vrp 2 + EZ 2 ) 

+P (P - |) /? 2 {rrp 3 + 3npEZ 2 + EZ 3 )} +0(\(3\ 3 ) . (6.6) 
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Put p = ii + rj. By Taylor expansion from (6.5) and (6.6), 

d w (a,P) = r 1 + u{l-u){(3Tip - (n - ±) [3 2 EZ 2 } + 0(5 3 ) , (6.7) 
d 01 (a,P) = -(l-n-r ] ){np-(n + r ] )[3EZ 2 + n(n-l)[3 2 EZ 3 }+0(5 3 ) 

= -7T (1 - 7r) fl + 7] 7T fl + {tT (1 - 7r) + (1 - 2tt) ??} /3 £ (Z 2 ) 

(i _ vr) _ I) /3 2 £(Z 3 ) + 0(5 3 ) , (6.8) 

uniformly in \/3\ < 5 and a such that \p — ir\ < 5, as 5 — > 0. 

Recall that A = (n -1 logn) 1 / 2 . By Taylor expansion from formula (2.2) for ^(a, /3), 
we have: 



1 n 1 n 

= _ £ (1 _ E) (1 - k) - - {1 + e*p(a + /3X,,)}- 1 
i=i i=i 

= - £(1- E)(l-I i )+p(l - p )f3-J2(l- E)X ij + O p (5 2 A) (6.9) 

i=i i=i 

1 n 1 n 

= - J] (1 - E) (1 - J,) + TT (1 - tt) /3 - (! - E ) X l3 + O p (5 2 A) , (6.10) 



n ^ n 

«=i «=i 



^(l-i?K>,/3) 



1 



£ (1 _ £) (i _ /.) _ I ^ {1 + exp(a + /^X^)}- 1 
" i=i n i=i 

= - £ (1 - E) (1 - J,) X tJ - (1 - p) -J2 (1 - *y + O p (5A) (6.11) 

n - n 

-£(1-^(1- /,) X - - (1 - tt) - ^ (1 - £) X y + O p (5A) , (6.12) 



n ■< — ' n 

i=i i=i 



uniformly in 1 < j < p, in |a| < C such that |p — 7r| < 5, and in < 8, as 5 — > 0. Here 
we continue to take p = e a / (e a + 1) and r] — p — n. Deriving (6.10) and (6.12) for each 
fixed j, a and (3 is straightforward. In the Appendix we shall show that (6.10) and 
(6.12) hold uniformly in those quantities, and more particularly that the remainders 
O p {5 2 A) and O p (S\) there can be written as Qj 5 2 A and 0j 5 A, respectively, where in 
both cases 0j satisfies (3.4). 

Define 

n n 

i=i i=i 
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and put 

n n 

f = -n' 1 ^2(l-E)Ii = n- 1 ^(1 -E) (a- h) (6.13) 

i=\ i=i 

for any constant a. Note that (d/da) ij(a, (3) = d w (a, (3) + (d/da) (1 — E)£j(a,(3), 
and that a similar formula applies in the case of (d/d/3) £j (a, 0). Combining these 
properties with (6.7), (6.8), (6.10) and (6.12) we deduce that 



+7i (1 - ti) {(3 7rfi-(n- I) (3 2 EZ 2 }+O p (5 3 ), (6.14) 



03'- ' 



9 p) = ef-a-Tr-^ef 



-tt(1 -7r)/x + 7r(l -tt)(3E{Z 2 ) + O p (5 2 ) , (6.15) 

uniformly in 1 < j < p, \a\ < C and \(3\, \fi\ < 5, where we assume 5 > A. The fact that 
^ and C,^ equal O p (A) uniformly in 1 < j < p can be proved as in the Appendix. 

Equating the right-hand side of (6.14) to zero, and solving for rj, we deduce that 



V = Vj = 



^ + 7 r(l-7r)/3^ 1) 7r(l-7r){^7r A i-(7r-|)/3 2 f?Z 2 } + O p {5 3 ) , (6.16) 



uniformly in 1 < j < p and solutions (a, (3) of the equations 

— £ j (a,/3) = 0, —£ J (a,(3)=0 (6.17) 

that satisfy |a| < C and \(3\ < 5, where it is assumed that < 5 and 5 > A. Write 
(aj,(3j) for any such solution. 

Substituting the expression (6.16) for rj into the right-hand side of (6.15), and 
equating to zero, we deduce that 

if ] ~ (1 - vr + df -7T (1 - tt) a* + vr (1 - vr) (3 E(Z 2 ) 

(i _ tt) (tt _ I) /3 2 £(Z 3 ) = O p (5 2 ) , (6.18) 

uniformly in 1 < j < p and solutions (a, f3) of (6.17) for which \a\ < C and < 5. 
Solving (6.18) for (3 we obtain: 

(3n (1 - tt) (£Z 2 ) {1 + O p (5)} = (1 - tt) - £< 2 > + vr (1 - tt) p + O p (<5 2 ) , 

from which it follows that, uniformly in the sense of: 

all 1 < j < p and all (aj,(3j) satisfying (6.17), < C and \(3j\ < 5, (6.19) 
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we have: 

. (1 - 7T) 6 1] ~ + 7T (1 - 7T) H . 

^ = 1 — Vw — + 0p( } • (6 - 20) 

Recall that a = log{7r/(l — it)}. Uniformly in the sense of (6.19), 

-<* = '-(r^)-'-(r^) 



a 3 



7T (1 — 7r) J 7T (1 — 7r) 

where pj = exp(dj)/{l + exp(dj)} and ^ is given by (6.16). Therefore, in view of 
(6.16) and (6.20), 

& s -ao= + O p ( V 2 ) = + O p (5 2 ) , (6.21) 

uniformly in the sense of (6.19). 

By (6.5) and (6.6), dio(o!o, 0) = and c? i(«0) 0) = — n (1 — it) /i. It can be deduced 
from (6.1) and (6.2) that 

/ exp(q + /3/x + /3Z) | 

f exp(q + /3Z) \ 
- (1 - 7r) ni + exp( a + /3Z)/' 
, / m p // . 7 x 2 exp(a + /3/z + /3Z) \ 



+ exp(a + (3 Z) J 

Therefore, ^20(^0,0) = — 7r, do2 = — n 2 /j 2 — ttE^Z 2 ) and dn(a;o,0) = — 7r 2 /i. 

Combining (6.21) and the results in the previous paragraph, and using Taylor 
expansion, we deduce that: 

= (aij - a ) d 10 (a Q , 0) + d Q1 (a , 0) + \ (aij - a ) 2 4o(«o, 0) 

+ \ (3] d 02 (a , 0) + (dj - a ) fa d u (a , 0) + O p (\&j - a \ 3 + |4f) 
= - n (l-n)P jf ,-ie {*(l-*) 2 y 1 -l$*EZ 2 + O p (6 s ), (6.22) 
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uniformly in the sense of (6.19). Define i? x = £(a , 0) — \ £ 2 {tt (1 — tt) 2 } 1 and Sij = 
{(1 - tt) - ^ (2) } {tt (1 - tt) EZ 2 }- 1 . (Here and below, R u R 2 and i? 3 will denote 
quantities not depending on j, and in particular do not depend on jj, = fij.) In this 
notation it follows from (6.20) that 

ft = St, + // (EZ 2 )- 1 + O p (5 2 ) , (6.23) 

uniformly in the sense of (6.19), and then from (6.22) that 

€(&,-,&■) = J Ri-vr(l-7r)(^ + -^)/i-|7r(5 1 , + -^) 2 EZ 2 + O p (5 3 ) 

= i?i - S'ij {?T (1 - 7T) + TT} 

-/i 2 (EZ 2 ) 1 {tt (1 - tt) + i tt} - i tt EZ 2 + O p (5 3 ) . (6.24) 



It can be deduced from (2.1) and (2.2), by Taylor expansion, that Dj(a,[3) = 
(1 — E) £j(a, (3) satisfies 

1 n 

i=i 

1 n 1 71 

= /5-E( 1 -^( 7r -^ x ^- a -E( 1 - E )^+° P ( 53 ) 

i=l i=l 

= /9{(vr-l)^ (1) +er}+«e + O p (5 3 ), (6.25) 

uniformly in 1 < j < p, \a\ < C and \f3\ < 5, where £ is as at (6.13). Replacing (a, (3) 
here by (ctj, (3j), and noting from (6.21) that &j — «o = — £ I 71 " (1 — tt)} 1 + O p (5 2 ) and 
also that /3j satisfies (6.23); and defining R 2 = [do — £ (1 — 7r)} _1 ] £; we deduce from 
(6.25) that 

A) = ^ S 2j + // (EZ 2 )" 1 5 2j + R 2 + O p (<5 3 ) , (6.26) 
uniformly in the sense of (6.19), where 

S 2j = (TT - 1) + £j 2) = -TT (1 - TT) (EZ 2 ) 5„ . 

Combining (6.24) and (6.26), and observing that £j(a,(3) = £(a, (3) + Dj(a,(3), we find 
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that 

£j(aj,$j) = R 3 + Sij S 2j + \i (EZ 2 ) 1 S 2j - ^Sij n (2 - tt) 

-a 2 (EZ 2 )' 1 \ TT (3 - 2 tt) - | tt (EZ 2 ) S°. + O p (5 3 ) 

= i? 3 - TT (1 - TT) (EZ 2 ) S* - | 7T (EZ 2 ) S* _ ^ (1 _ ^ ^ . 

— 7T (2 — 7r) /tSlj — \ 7T (3-2 7r)(EZ 2 )~V + O p (5 3 ) 
= i? 3 - \ tt (3 - 2 tt) (EZ 2 ) - TT (3 - 2 tt) /X S 1:j 

-§7r(3-27r) (EZ 2 ) -1 // 2 + P (5 3 ) 
= i? 3 - | tt (3 - 2vr) (EZ 2 ) (s y + + O p (5*) , (6.27) 

where R3 — R1 + R 2 and does not depend on j. 

Since we assumed that (aj,(3j) minimises £j(a, (3) within rT c of (c*o,0), where 
< c < 1, then we may take 8 = n~ c . Then, iterating (6.27), we conclude that (6.27) 
holds with 5 = A. In that case (6.27) is identical to (3.3). 

6.2. Proofs of Theorems 2 and 3. For brevity we treat only pairs j 1: j 2 such that ji^ > 
and /ij 2 = 0. Assume that the conditions of either Theorem 2 or Theorem 3 hold. Put 
Wj = n 1 ' 2 Sj and 8j = n 1 / 2 fij. Then 

P (4 < 4) = P{ (W n - W n + 8 h ) (W n + W j2 + ^ ) < Q jlj2 n- 1 ' 2 (log n) 3 / 2 } , (6.28) 

where, by (3.3), the random variable 0j U - 2 satisfies 



lim limsupPI max max |©j U2 | > C = (6.29) 

C^oo „^oo ^ ji : ji^ >0 j 2 :Hj 2 =0 J 

for a constant C > 0. Also, defining a 2 = n (1 — n)EZ 2 to denote the asymptotic 
variance of n 1 / 2 Sj, we have: 

cov{W h + W n) W n - W j2 ) = E(Wl) - E(W 2 2 ) =a 2 -a 2 + o(l) -> . 

From this property, the definitions of W jl and W^- 2 , and the Berry- Esseen bound for 
sums of independent random vectors, we find that 

P (w h + W j2 < Xl , W n - W j2 < x 2 ) = ^( Xl /a jlj2 , + ) 'I>i.r, /rr ; ) + o(l) , (6.30) 

uniformly in real numbers xi, x 2 and in j±, j 2 such that ji^ > and /ij 2 = 0, as n — > 00. 
Combining (6.28)-(6.30) we deduce that if /t^ = n~ l l 2 c njl then 

P (4<4) = { $ (- C "iiMii2,+) $ (Cr i i 1 /^ 1 i 2 ,-) 

+$( Cnjl /(7 jlj2)+ ) ^(-Cnjjffj^ _) } + O(l) , (6.31) 
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uniformly in j 1 , j 2 such that fij 1 > and \ij 2 = as n — > oo. Theorems 2 and 3 follow 
from (6.31). Note that in the context of Theorem 2, c n j — > uniformly in 1 < j < p. 

6.3. Proof of Theorem 4. Result (6.28) continues to hold, with Sj = n 1 / 2 /ij = 
c(logn) 1 / 2 when /ij > 0. In place of (6.29) we use the following result, a consequence 
of (3.4): for a constant C > 0, 

. max . max P(|0^ 2 | > C) = O^ 4 ) , 

Jl : >0 J2 : //j 2 =° 

where B 4 = | {_B 2 — 2max(i?i + 3,2Pi)} — e for any e > 0. Therefore, since n -1 / 6 x 

n _l /6 > n _i /2> 

p(4<4) = p^+^+^-i™-^ 

+P(W^ + + ^ > | n" 1 / 6 , - W j2 + ^ < -| n^) 
+e jlj2 +0(n- B *), (6.32) 

uniformly in j±, j 2 such that /x^ > and fij 2 = 0, where 

1**1*1 < Kh = P(W h - W j2 +5 n e[-l n^l\ § n^\) 

+P(W n +W j2 + 6 n e l-\n^,\n^]). (6.33) 

Given a random variable V, write (1 — E) V to denote V — E(V). Define Vj = 
{it (1 — 7r)} _1 rT 1 ! 2 J2i {h ~ n ) Zij, which has zero mean, and note that it (1 — it) (Wj — 
Vj) = c (logn) 1 / 2 Aj, where, defining Iy to be the indicator of the event that /ij > 
(conditional on Xi is from ITi, or equivalently on ij = 1), we put 



i=l i=l 

Bernstein's inequality can therefore be used to prove that for all B 5 , B e > 0, 

£p(|A,-| >B 5 n- 1 / e ) =0{n- B «). 



It follows that (6.32) and (6.33) continue to hold if we replace Wj by Vj, and \ n 1 I & 
by rT 1 ^ , at each appearance: 

^(4<4) = + ^ + ^-i < -n" 1/6 . - ^2 + «ii > n" 1/6 ) 

+0, 1J2 +O(n- B4 ), (6.34) 
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uniformly in ji, j 2 such that > and = 0, where 

\0 jlj2 \ < cf> jlj2 = P(V n -V n + 5 n G l-n l '\n-^]) 

+P{V jl +V j2 +5 jl e [-n-^n- 1 / 6 ]) . (6.35) 

Property (3.8) implies that the random variables (I± — tt) satisfy a uniform 
pairwise Cramer continuity condition: (3.8) holds if Zj x and Zj 2 there are replaced 
by [l\ — 7r) Zij 1 and (I± — n) Z^ 2 , respectively. This property, and standard methods 
for deriving Edgeworth expansions of distributions of random vectors (Bhattacharya 
and Rao, 1976), then allow us to prove that, uniformly in j ± , j 2 such that /i^ > and 
fij 2 = 0, and in real numbers x± and x 2 , 

P(V n + V j2 < x x , V n - V j2 < x 2 ) = P(N jlj2j+ < x u N hj2t _ < x 2 ) 

fc 

+ Y,n- k/2 P k (x 1 ,x 2 ) ( f ) (x 1 ,x 2 ) + 0(n- B2 ) , (6.36) 
k=i 

where (A^- u - 2i+ , Nj^J) denotes a normally distributed random vector having zero mean 
and the same covariance matrix as (Vj 1 + Vj 2 , Vj l — Vj 2 ), is the density of the distri- 
bution of (N^^, Nj x j 2 _), the quantities P k are polynomials of degree 3k — 1 with the 
same parity as k + 1 and uniformly bounded coefficients (for k < ko), and k equals 
the largest integer strictly less than B 2 (recall that in Theorem 1, and hence also in 
Theorem 4, we assumed that E\Z\ B2 < oo). Here we have used the fact that, in view 
of (3.8), 

EiV^ ± Vj 2 ) 2 is bounded away from zero uniformly in values of ji,j 2 ,„ 
satisfying 1 < ji < j 2 and ji ^ j 2 , and for both choices of the ± signs. 

Now, E{(V n + V j2 ) (V n - V j2 )} = E(V?) - E(V 2 ) = 0, and so N jlja>+ and N jlj2 ,_ arc 
independent. Therefore (6.36) implies that, uniformly in the sense there, 

P{V h + V n < x x , V n - V j2 < x 2 ) = P(N jlj2 , + < Xl ) P(N jlht . < x 2 ) 

ko 

+ J2n- h/2 P k (x 1 ,x 2 )Mxi)<P-(x2) + 0(n~ B2 ) , (6.38) 
k=i 

where (f>± denotes the density of N^j 2 ^± for respective values of the signs. 

Put n n = (logn) 1 / 2 and recall that Oj^± is defined at (3.6). Using (6.37) and 
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(6.38) it can be deduced that 

E E P{V n + V n +5 n <-n- l l\V n -V n +5 n >n l l") 

jl : 32 : Hj 2 =0 

~ E E P{N nnt+ <-n- l / & -CK r )P(N nn ,_>n-^-CK n ) 

jr'i : iJ, jl >0 J2 ■ M 2 =0 

~ E E P(N hj2& <-n-^-CK n ) 

jl : ii Tl >0 32 : Hj 2 11 

~ E E $ (- Cfi > WJ , + ), (6-39) 

jl : ii Tl >0 32 : M 2 =0 

and similarly, 

E E P ( V * + V h + > -« _1/6 , V n - V j2 + 8 h < n- 1 / 6 ) 

~ E E ^(-c«n/^ 2) -), (6.40) 

ji:At 31 >0 j 2 :^ j2 =0 

and that both of these quantities are of a strictly larger order of magnitude than 
E E P(V 3l ±V n+ 5 n e[-n^,n-^}), 

for either choice of the ± sign. Write (P) to denote the latter property. (Note that 
if N is a standard normal random variable then P(N > t) ~ P(N > t + 5 t ) as 
£ — > oo, for any quantity 5 t that satisfies t<5 t — > 0, and that if in addition 5 t > 0, 
P(N G [t-£ t , « + <**]) ~ (2/tt) 1 / 2 ^ exp(-t 2 /2).) Combining (P), (6.34), (6.35), (6.39) 
and (6.40), and noting that by assumption p 1 (p — pi) = o(n Bi ), we deduce that (3.9) 
holds. 

6.4. Proof of Theorem 5. Result (4.4), with the remainder term satisfying (4.5), will 
follow from Theorem 1 if we show that for a constant B 7 > 0, 

v 

E P{\U 3 2 - U j2 \ > B 7 A 3 ) = 0(n- B ±) . (6.41) 

3=1 

Bernstein's inequality can be used to prove that, for any given B> 8 > 0, we can choose 
B 9 > 0, depending on B 8 , such that 

P(\tt-7t\ > B 9 X) =0(n- Bs ) . (6.42) 
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Put Zj = n~ l J2i z ij and zf = n~ l £\ ( z !j ~ Ez2 )- Th en, since E\Z\ B * < oo, we 
have for constants -Bio, -Bn > and all e > 0: 

J2 {P(\Zj\ > B w A) + P(\Si\ > B w A)} = o(^i+-(s 2 -D/2) ) (6 43) 

3=1 

P{\Z^\ > Bu A) = 0(n 1+ ^ B ^) . (6.44) 
(The method of proof is that leading to (A.l) in the Appendix.) Define 

T 3 = ~Y1 (* - T) ^« = T (1 - T) $ . ^ = " E ( 7 < " *) - ^) » 

n n 
i=i i=i 

and note that 

^ = T i - + 7T ^ + (tt - tt) ^} (tt — 7T) . (6.45) 
Result (6.41) follows from (6.42)-(6.45), the bound |//,-| < cA, and the property 



7r(3-27r)T/ 7r(3-27r)T/ 
2f 1 /2{7r(l -tt)} 2 ' ^ ~ 2r 1 /2{ vr (l -vr)}^ 



Here we used the fact that, by the definition of _B 4 , B 4 < \ B 2 — 1. 



A APPENDIX: Proof that (6.10) and (6.12) hold uniformly 

in 1 < j < p, \a\ < C such that \p — n\ < 5, and \/3\ < 5 

We shall give proofs of (6.11) and (6.12). Derivations of (6.9) and (6.10) are almost 
identical, and in fact passing from (6.9) to (6.10), and from (6.11) to (6.12), requires 
only the properties \p — n\ < 5, which is an assumption, and 



sup 

i<i<p 



1 n 

-J2(l-E)X tJ 



O p {\) , 



which follows from standard moderate-deviation results (see Rubin and Sethuraman, 
1965; Amosova, 1972). Indeed, those results imply the following stronger property: for 
a constant B 3 > 0, depending on B 1 and B 2 in Theorem 1, 



7=1 ^ H i=l > 



0( 



n 



Bi+e-(B 2 -l)/2 



) 



(A.l) 



for each e > 0. 
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The claim that (6.11) holds uniformly in 1 < j < p, in \a\ < C and in \(3\ < 5 will 
follow if we show that, with Ay(a, f3) — {1 + exp(a: + (3 Xij)}^ 1 — (1 — p), we have: 



sup 

l<?<P,|a|<C,|/3|<<5 



1 n 

i=i 



= O p (SX) , 



(A.2) 



uniformly in the same sense. In fact, it follows from (A. 8) and (A. 9) below that a 
sharper form of (A.2), 

P f 1 n 1 

V P\ sup - V (1 - E) Xij Aijia, (3) > const. 5\ \ = 0(n~ B4 ) , 

~[ I \*\<c,\p\<s n ~[ J 

where B 4 = | {5 2 — 2 max(Sx + 3, 25i)} — e for any e > 0, holds. This stronger bound, 
together with (A.l), implies that (6.11) and (6.12) hold in the stronger sense that 
the remainder O p (5X) there can be replaced by 0j 8X, where Ylj > const. ) = 

0(n~ B4 ) and B 4 = \ {B 2 - 2max( J B 1 + 3,2£i)} - e. That result gives the sharper 
version, implied by (3.4), of (3.3) in Theorem 1. 

With probability 1, 



\Aij{a,(3)\ <K X min(l,|/3X, 



lAy^ai,^) 


- A ij (a 2 ,(3 2 )\ 




OL\ — OL 2 


+ |A - A 





< 7^(1 + 1^1), (A.3) 



uniformly in \a\, \a±\, \a 2 \, \/3\, \(3i\, \f3 2 \ < C, where, here and below, K x , K 2 , . . . are 
fixed positive constants. Using the first part of (A.3), and standard arguments for 
proving moderate-deviation results for sums of independent random variables (Rubin 
and Sethuraman, 1965; Amosova, 1972); and assuming that E\Z\ 2 ^ B+1+ ^ < oo, where 
B, e > 0; it can be shown that 

f 1 n 

sup P{ - V (l-E) XijAifaP) 
i<j<P,M<c,|/3|<5 I n 

>K 2 5(Bn~ 1 log n) 1/2 1 < K 3 n~ B . (A.4) 

Partition [— C, C] and [—5, 5] into lattices of equal edge width d n , where < d n < 5, 
and let Vi and V 2 denote the respective sets of lattice vertices. The number of vertices 
in each lattice equals Old' 1 ), and so by (A.4), 

f 1 n 

sup P sup — y (1 — E) X l3 Aij(a, (3) 
i<j< P I aevi,/3ev 2 n ~{ 

> K 2 5(Bn- 1 log n) 1/2 j < K 3 d~ 2 n~ B . (A.5) 
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Given cci E [— C, C] and e [—5,5], let a 2 and /3 2 be the lattice vertices nearest to 
cti and respectively; ties can be broken arbitrarily. In view of the second part of 
(A. 3), for all a\ E [— C, C] and all f3i E [—5,5], and with probability 1, 

±Y t (l-E)X ij A ij (a 1 ,p 1 ) 

- ^(l-£)^A,> 2 ,/3 2 ) +AK 1 d n -J2( 1 + \ X ij\) 2 - ( A -6) 

i=i i=i 

Since E\Z\ 2{B+1 +^ < oo then, by Markov's inequality, for each K 4 > E(/j, 2 + Z 2 ) + 1, 

snp pL- 1 Y,{1 + \X ij \) 2 >kA <K 5 n^ B+1+ ^ 2 . (A.7) 
i<i<p I ~( J 

Combining (A. 5), (A. 6) and (A.7) we deduce that 
= P 



sup 

l<j<P,M<C*,|/3|<<5 



1 1 

- J](l-£)^A;>,/3) >X 6 (<5A + d n H 
i=i y 

= 0{p « 2 n~ B + n^ B+1+ ^ 2 ) } . (A.8) 



Take d n = SX and 5 > A. Then (A. 2) follows from (A.8) provided that p (A 2 n B + 
n -(B+i+ e )/2) o. Since p = 0(n Bl ) then it is sufficient that B > max(5i + 1, 2£i - 1). 
That is, we should ensure that E\Z\ B2 < oo where B 2 > 2max(£>i + 3,2£>i), which 
assumption is imposed in the theorem. In this case it follows from (A.8) that 



the left-hand side of (A.8) equals 0{n B *) where B 4 = \ {B 2 - 
2max( J Bi +3,2.01)} - e. 



(A.9) 
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